Ephemeral GitHub Runners: Proxmox VMs

ephemeral self-hosted github runners proxmox

Understanding ephemeral self-hosted github runners proxmox Architecture
Prerequisites to isolate ci/cd workloads homelab
Building the proxmox lxc github runner Template
Automating terraform proxmox github actions
Resolving Stale Instances and Deregistration Failures
Frequently Asked Questions

Every day at work, for some reason I have to deal with a dead runner again, at 2:00 a.m. My deployment has failed and on my GitHub Actions interface, I see my persistent self-hosted runner as ‘offline’, with a stale registration token. I have gone through this before… A home server that runs self-hosted runners 24/7, and the runners eventually get lockup, drift from their locations, or fill up disk space. I decided I needed a way to use self-hosted runners as if they were disposable, so that after each job/tasks, I can spawn a new one. I have a Proxmox VE cluster that is just sitting there gathering dirt, so I have started to use it as an ephemeral/self-hosted runners setup, using LXC templates, Terraform, and a simple webhook. Honestly, it has changed how I operate my homelab CI/CD.

Quick Summary

One-time-only use GitHub Action runners/spin up, run an LXC container, and run with Proxmox.
Scale down to zero: No persistent runners sit idle.
Automatic token rotation ensures that JIT tokens are always up to date.
The entire workflow is triggered by a GitHub Webhook, and the Terraform will run the apply.

Understanding ephemeral self-hosted github runners proxmox Architecture

Before typing into a terminal, it is important to understand why the ephemerally model resolves many homelab headaches.

Ephemeral vs. Persistent Runner Lifecycles

A persistent runner is essentially an old server being patched repeatedly, which leads us to creating state, running out of space, and ultimately losing the registration token. With an ephemeral runner, however, the lifecycle is very short; the runner comes into existence, takes on one job, and then dies. There is zero state, zero drift.In my homelab setup, instead of managing the maintenance and upkeep of numerous virtual machines (VMs), I focused on developing a streamlined process for provisioning new LXC containers from clean templates quickly. This involves two components: creating a clean template from an existing image and creating a in-pin token.

High-Level Webhook Trigger Workflow

The flow is very straightforward. A GitHub application sends a workflow_job event to a listener on my Proxmox machine that is always running. Every time the listener receives an event it executes a Terraform plan and applies to generate a new LXC container from the template, it injects the JIT registration token obtained through the official GitHub API, and powers up the container. The runner is set to monitor for events and is notified when the container is ready to execute a job. The container self-destructs after completion.

Prerequisites to isolate ci/cd workloads homelab

You can’t have your runner located on the same network as your media server—this would allow access to all your media files. I wanted complete isolation.

Proxmox API User and Role Configuration

To accomplish this, I created a new Proxmox user that only has permission to create and remove containers. To create a new user, go to the Datacenter > Permissions > Users menu entry and click on the Add button to create the new user. Click on the user to generate an API Token, I too checked the privilege separation box so that my token will never escalate privileges.

Here is an example API call to perform the same operations through CLI that I saved as an Ansible one-liner for later reuse.

pvesh create /access/users/runner-api@pve/token/ephemeral-runner --privsep 1
200 OK
┌──────────────────┬──────────────────────────────────────────────────┐
│ key              │ value                                            │
╞══════════════════╪══════════════════════════════════════════════════╡
│ token            │ PVEToken:runner-api@pve!ephemeral-runner=xxxxxxx │
└──────────────────┴──────────────────────────────────────────────────┘

The newly generated API token can then be saved as a Terraform variable, and the role assigned to the user as shown in the Permissions > Add section of the Proxmox UI limits the user to PVEVMAdmin permission for a specific pool—nothing else is permitted. This results in a very narrow scope and avoids any surprises.

Configuring Virtual LANs (VLANs) for Runner Isolation

All of my CI / CD containers are on a dedicated VLAN (VLAN30) that has Internet access but who do not have the ability to communicate with my management LAN. On Proxmox I created a Linux Bridge for VLANs with the LXC Network Interface tagged for that VLAN. This is where the isolating CI/CD workloads from my NAS really happens.

Generating the GitHub App Private Key for Authentication

Instead of using a personal access token, I decided to use a GitHub App so I could receive the fine-grained permission events workflow_job webhook events. Within the settings of my GitHub App I generated a private key and placed it in the webhook listener of my Proxmox Host. The webhook listener uses this private key to request the installation access token which fetches the JIT runner registration token.

Building the proxmox lxc github runner Template

The template is the core of the whole setup. It is just a minimal Ubuntu/LXC with docker and action runners installed.

OS Initialization and Unprivileged Container Setup

I started with the standard Proxmox Template of Ubuntu 22.04 and added the following modifications to run Docker on it by enabling nesting and keyctl. I had to edit my config file at /etc/pve/lxc/100.conf:

arch: amd64
cores: 2
features: nesting=1
hostname: runner-template
memory: 2048
net0: name=eth0,bridge=vmbr0,tag=30
ostype: ubuntu
rootfs: local-lvm:vm-100-disk-0,size=20G
swap: 0
unprivileged: 1
lxc.apparmor.profile: unconfined
lxc.cgroup2.devices.allow: c 10:200 rwm
lxc.mount.entry: /dev/net/tun dev/net/tun none bind,create=file

The AppArmor section and the section on Device Allow will allow docker to run in the unprivileged container without showing any harmful types of missing permissions. I have seen many guides miss the lxc.cgroup2 entry and wonder why docker run was hanging.

Installing Dependencies and the Docker Daemon

I completed the tasks of booting my Container Image and updating it as well as installing Docker via the official Docker Installation Script. The important thing was to set up a storage driver that is compatible with the Filesystem being used to support the storage driver overlay2. The overlay2 storage driver requires either an ext4 or xfs filesystem. My template image’s root filesystem is based on thin volumes and is of the ext4 type, so everything is valid for this use case.

apt-get update && apt-get install -y curl uidmap
curl -fsSL https://get.docker.com | sh
cat > /etc/docker/daemon.json <<EOF
{
  "storage-driver": "overlay2",
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}
EOF
systemctl enable docker && systemctl start docker

By using the docker installation script, I obtained a Docker daemon that:

will automatically start when the container image boots
will not consume all the space on my disk drives because of logging.

Embedding the Actions Runner Binary

I will download the runner tarball into the template image so that every time a Runner is cloned, it will always have access to the same binary. This way, I do not need to download the binary each time for each new cloned runner.

mkdir -p /home/runner/actions-runner && cd /home/runner/actions-runner
curl -o actions-runner-linux-x64-2.319.1.tar.gz -L https://github.com/actions/runner/releases/download/v2.319.1/actions-runner-linux-x64-2.319.1.tar.gz
tar xzf actions-runner-linux-x64-2.319.1.tar.gz

I will not perform any additional configuration of the runner in this step; the “ephemeral” clone will run config.sh with the –jit option and the token at the time the container image is booted.

Why I Ultimately Chose This Route

I played around with full VM images in QEMU first. They worked, but when I cloned a 10Gb VM, it took almost one minute, and I had to use cloud-init each time. Cloning with LXC is done in less than two seconds.

Some of my workflows require running Docker-in-Docker and trying to set this up inside of a VM using nested virtualization was hit-or-miss depending upon the installation environment used. Using unprivileged LXC containers with the correct apparmor profile, I have been able to achieve maximum density for my tiny OptiPlex node. This allows me to run multiple simultaneous runners without issue.

Automating terraform proxmox github actions

I have no desire to use a dashboard or a controller daemon that consumes resources running in a constant state. I am focused on achieving true scale-to-zero.

Configuring the Proxmox Terraform Provider

The Telmate Proxmox Provider is designed to work with the Proxmox API natively.The provider.tf file points to the cluster and specifies an API token to access it. The Terraform resource block uses the cloud-init template to create a full clone of the LXC template and assign a unique hostname, as well as to run JIT during boot through the use of a cloud-init Snippet.

Achieving scale to zero github runners via Webhooks

The system remains completely off until a webhook from GitHub triggers it. I have a small Python Flask app running on the Proxmox host (this could also be done via systemd unit files) that listens for webhook events of type workflow_job with a status of queued. When the app receives the event, it creates a Terraform variable file containing the runner label for the job, and then executes a terraform apply -auto-approve command against the file. After the job is finished, GitHub issues an event of completed, and the webhook service calls terraform destroy on that runner instance.
That’s true scaling to zero with GitHub runners; there are no idle containers, no polling loops, only a webhook waiting for requests. The combination of ephemeral self-hosted GitHub runners and the proxmox approach has managed to keep my power bill from creeping up again.

Evaluating the actions-runner-controller setup Alternative

I spent a weekend testing actions-runner-controller (ARC) on a local k3s cluster. I love it, but it’s geared toward people who are already using Kubernetes. Still, running an additional cluster on Proxmox to handle CI seemed heavy, and ARC requires a persistent manager pod, which is small, but I was looking to maintain zero running resources between jobs.
For this homelab, I’ve chosen to stick with Terraform + Webhook rather than ARC (Ansible Runner Controller). When I get to the point where I need a full k8s setup, then I’ll examine ARC once again. For now, Kubernetes is simpler than creating a new LXC instance.

Resolving Stale Instances and Deregistration Failures

There will always be bugs that happen. A job may fail unexpectedly, and the ephemeral runner won’t self‑destruct. You now have a zombie that is holding a lock.

Managing github actions runner token rotation Failures

JIT tokens expire quickly. According to GitHub docs, they expire in 10 minutes. If the LXC clone takes longer than that to create, then the token will be expired. My webhook listener takes care of this by generating a separate JIT token just before Terraform begins executing. If there is an error during registration, the webhook listener destroys the failed container and tries again. Here is the curl command that is utilized by the webhook listener.

curl -X POST -H "Authorization: Bearer $(gh-app-token)" \
  -H "Accept: application/vnd.github+json" \
  https://api.github.com/orgs/myorg/actions/runners/generate-jitconfig \
  -d '{"name":"ephemeral-runner-123","runner_group_id":1,"labels":["lxc-runner"],"work_folder":"_work"}'
{
  "runner": {
    "id": 12345,
    "name": "ephemeral-runner-123",
    "os": "linux",
    "status": "online"
  },
  "encoded_jit_config": "LS0tLS1CRUdJTi...="
}

The encoded_jit_config is sent as base64 encoded into the container’s cloud-init for the runner’s config.sh to utilize. If the curl command returns a 404 or a token error, the webhook listener will bail on the creation and not create a zombie.

Sweeping Orphaned Containers via Cron Jobs

Even care has been taken to manage the containers, I have experienced a Proxmox host that went offline during a job execution. This means the container is still active, but the runner is no longer available. My cron job that runs every 10 minutes catches these containers and removes them.

#!/bin/bash
for vmid in $(pct list | grep 'runner-' | awk '{print $1}'); do
  status=$(pct status $vmid)
  if [[ "$status" != *"running"* ]]; then
    echo "Removing offline runner container $vmid"
    pct stop $vmid --skiplock 2>/dev/null
    pct destroy $vmid 2>/dev/null
  fi
done

The script checks containers whose names start with runner- and removes those containers that are NOT in the running state. It’s not the prettiest, but it has kept my pool clean for several months.

Frequently Asked Questions

How do I fix storage driver issues with Docker inside a Proxmox LXC?

Docker will refuse to execute if the Storage Driver does not match the Filesystem Type. If the LXC is Unprivileged and has the Storage Driver Overlay2 , then your Filesystem Type must be ext4 or XFS. If your template is using ZFS, then switch from Overlay2 to Fuse-Overlayfs.

To install Fuse-Overlayfs, you need to install the Fuse-Overlayfs package, and edit /etc/docker/daemon.json and add “storage-driver”: “fuse-overlayfs” as an entry in the JSON file. Restart Docker after that.

This will enable you to run Docker inside an Unprivileged LXC and fix the Storage Driver issues associated with Docker.

What happens if two CI/CD jobs try to clone the same LXC template simultaneously?

Proxmox’s template clone is not atomic across multiple requests, so you will get a lock contentation error. I mitigate this issue by creating a Unique Name for the Container with a Job-Run ID, and using the -parallelism=1 Option in the Webhook Listener.

This way the Webhook Listeners will serialize the Applies. If the job Queue’s up for a Lightning-Round FIFO, that is possible, but I have never received Concurrent Jobs at 3 A.M. in my Homelab.

Why does my ephemeral runner show as “offline” in GitHub after the Proxmox clone finishes?

In 9 out of 10 cases, the JIT Token expires before the Runner can register itself. Make sure that the System Clock on the LXC Template is synchronized (Run timedatectl set-ntp true), and make sure that your Webhook Listener is generating the JIT Token just after Terraform Applies, and not when the Webhook Listener first receives an HTTP Request.

Be sure your Container has the ability to reach api.github.com on the Outbound side via Gateway Rules, otherwise you may have VLAN isolation issues.

0 5 9 minutes read